Introduction

Deploying LLM-powered systems in production is the easy part. The hard part? Making sure they’re actually working.

I’ve been deploying LLM-powered systems in production in many companies and across different industries for almost 3 years now. Each time, I encountered the same critical challenge: how do you truly evaluate whether your LLM is performing well and not only rely on vibe checks ?

If you’re struggling to move beyond “it looks good to me” when evaluating your LLM applications, this blog post is for you.

This blog post contains lessons learnt through hands-on experience. These lessons come mostly from my experience testing what I have learnt in deploying real-life LLM applications, talking with peers, doing courses, and reading blog posts.

Apart from my personal experience, Hamel Husain & Shreya Shankar both course & blogs on LLM evaluation have been of a great help to me and many of the techniques I discuss here are either directly quoted from their work or highly inspired by it.

Each of the lessons below will tackle a specific part of building and evaluating real LLM applications.

Pre-lesson

I can’t emphasize this enough and even though you’ve probably heard it a million times before, I’ll have to say it: KNOW YOUR PRODUCT AND YOUR USERS.

If you don’t know your users and your product well, you won’t be able to understand the different ways they’ll interact with your system, the different types of queries they’ll make, and the different ways they can express the same intent. This is going to be a major obstacle whether you want to create synthetic data, to create evaluation metrics, or to interpret the results of your evaluations.

Now that this is said, let’s dive into the lessons!

Lesson 1: I’ll vibe-check my app

Here’s a common scenario I’ve encountered many times: a company builds a RAG system over their product documentation, and naturally wants to evaluate and improve it. But there’s a pattern I keep seeing so often, companies expecting AI to “just work” out of the box.

This expectation shows up most clearly in evaluation practices. Companies often use “vibe checks” by manually asking a dozen of questions and eyeballing whether the system answers seem reasonable.

This is a terrible way to evaluate your LLM applications for multiple reasons:

The queries you’re going to ask are biased towards what you think the system should be able to do or towards a specific range of queries that you expect the final user to enter. You will likely miss most of the cases and failure modes that you didn’t think about. From first-hand experience, the users usually will surprise you with the way they write their queries and the types of queries they will enter. You should expect them to write them as if they are in a rush 😅
Not having evaluated your app in a systematic and/or scalable way will lead to have to deal with “whack-a-mole” situation (as Hamel so beautifully puts it) where you fix one by changing some prompt only to have another one pop up somewhere else. This will lead to frustration from your stakeholders, frustration from the team working on the app, and maybe even to a a lack of trust in LLMs whithin your organization.

Lesson 2: I don’t have any data to test my application, where do I start ?

Once you develop your LLM-powered app, you are faced with “Which comes first, the chicken or the egg?” dilemma:

You don’t have real user queries data because you haven’t deployed you app yet
You can’t deploy your app yet because you haven’t tested it with real user queries yet

Sub-lesson 1: Real data is better than synthetic data

You can almost always get some pseudo-real data. If you can’t have access to some beta users, ask your teammates to test the system. They will have very probably some biases of their own, but at least you will get some data that is not completely synthetic and that has different characteristics because it’s coming from different people.

Sub-lesson 2: The bad way to create synthetic data

Real data is always better than synthetic data. But hey, if you really can’t have some real data, then synthetic data is the way to go.

The mistake most people do when creating synthetic data is to ask an LLM to generate queries that are similar to what they expect the users to enter. This is a also a bad idea as the generated queries will be biased towards what you think the users will enter and will likely miss many failure modes. Most importantly, the generated queries will likely be “too good” and not representative of real user queries (messy, mispellings, incomplete…).

I’ve trained a retriever in the past on synthetic data generated this way. While the performance on synthetic-queries-like was really good, the performance on real user queries was really bad. The gap between synthetic data and real user data was just too big.

Sub-lesson 3: The good way to create synthetic data

Think of dimensions of variability of user queries: An approach that I have learnt from Hamel Husain & Shreya Shankar is to first think about the different dimensions of variability in user queries for your specific application. For example, if you’re building a RAG over a technical product documentation, you can think about several dimensions of variability, such as:
User type (user, admin, developer, etc.)
Intent (seeking information, troubleshooting, feature requests, etc.)
User expertise level (beginner, intermediate, expert, etc.)
Query length (short queries, long queries, etc.)
Query complexity (simple queries, complex queries with multiple sub-questions, etc.)
Query style (formal, informal, typos, etc.)
etc.

As a query always depends on the context of the application and the persona of the users (a wink 😉 to the pre-lesson above), a dimension should be really specific to your application and not some general dimensions that someone else has used in another context.

Then, for each dimension, you can brainstorm different values that the dimension can take (or delegate the task of brainstorming values of some dimensions to an LLM). For example, for the “user expertise level” dimension, you can have the values: “beginner”, “intermediate”, “expert” as shown above.

Once the list of dimensions and their possible values is ready, you can start combining them to create tuples that will represent different synthetic queries.

Here are some examples of tuples representing combinations of the dimensions mentioned above.

("end_user", "troubleshoot", "beginner", "short", "simple", "typos")
("developer", "integration_info", "expert", "long", "complex", "formal")
("admin", "permissions_help", "intermediate", "short", "simple", "informal")
("end_user", "feature_discovery", "beginner", "short", "simple", "incomplete")
("support_engineer", "root_cause_analysis", "expert", "long", "complex", "dense")
("end_user", "account_status", "beginner", "short", "simple", "mixed_case")
("developer", "performance_optimization", "expert", "medium", "complex", "typos")
("admin", "audit_logging", "intermediate", "medium", "moderate", "formal")
("end_user", "error_meaning", "beginner", "short", "simple", "abbreviations")

Each tuple becomes a prompt seed you can use to generate multiple queries from 🚀

And here you have it, asystematic way to create synthetic data that covers a wide range of possible user queries for your specific application

Lesson 3: Have a systematic way to identify failure modes

When you start evaluating your LLM applications, you need to have a systematic way to identify failure modes. Not having a systematic way to identify failure modes will lead you to the same “whack-a-mole” situation mentioned earlier where you fix one failure mode only to have another one pop up somewhere else.

You can identify failure modes using synthetic data (created as explained in Lesson 2) or real user queries (if you have access to them) even before deploying your app in production.

Sub-lesson 1: The three steps for error analysis

In short, the steps are: - Generate your system’s anwsers for the queries that you have (real or synthetic). Gather the traces of the execution for each query. - Open-coding: Note down the first failure that appears in the trace for each query (if present) - Axial coding: Cluster the failures into families of failure modes.

This three steps are the key to a successful error analysis. Since I’ve discovered this approach in Hamel & Shreya’s course, it has become my go-to approach for identifying failure modes in LLM applications in a systematic way.

Traces are king

A trace is a detailed log of the execution of your LLM application for a query. It includes all the intermediate steps, LLM calls, tool calls, and final output.

Well, now that you have a set of user queries (either real or synthetic), input each query through your LLM application and generate the whole trace of the execution. The trace should include all the intermediate steps, LLM calls, tool calls, and final output.

You should go through a good number of queries to cover the diversity of your queries (usually around a 100 but really depends on the complexity of your application and the diversity of your user queries, could be more or could be less than a 100).

Open-coding

Once you have the traces, go through each trace one by one and note down the first failure that appears in the trace for each query (if present). The description of the failure should be a bit specific. For example, you can say “Didn’t call the call_customer_tool even though the customer asked the system to do so”.

Axial coding

Now that you have annoatated the failures in your dataset, the next step is to put them into clusters of failure modes. You can do: - this manually by going through the list of failures and grouping them into families of failure modes - using an LLM to help you with the clustering. You can provide the LLM with the list of failures and ask it to group them into families of failure modes. This is usually my starting point as it saves a lot of time. Then, I go through the clusters and refine them if needed.

Now, here’s a diagram that summarizes this 3-step process:

What comes next ?

Prioritization: Now having these clusters of failure modes will help you identify which exact parts of your system are responsible for failures and will help quantify how many failures are due to each failure mode.

Iteration: Once you have identified the failure modes and prioritized them, you can start working on fixing them. After fixing them, you can go through the same process again to identify new failure modes that are still there. It’s very important to know that error analysis is an iterative process. You need to do it two or three times at first.

Scale your evals in production

You might ask yourself if you’ll be doing this manually over production data to verify if th? The answer is that once you have a good understanding of the failure modes and have fixed the most critical ones, you can scale your evaluation by running a custom LLM-as-a-judge for large samples of queries that you have in prod.

Here’s how to do it: - Get the traces that you’ve labeled previously with successful and failed queries & split them into a train, a dev and a test set.

Create a custom LLM-as-a-judge for each failure mode:
- For each failure mode, create a prompt that describes the failure mode in a very specific way. Don’t be broad in your description and don’t use vague word (like good, etc). and provides examples of successful and failed queries from your labeled dataset.
- Provide the LLM with example of both successful and failed queries from your train dataset to help it understand the failure mode better.
- Eval the performance of the LLM-as-a-judge on your dev set and iterate on the prompt until you reach a satisfactory performance
- Once you’re satisfied with the performance on the dev set, test the LLM-as-a-judge on your test set to get an estimate of its performance on unseen data. This should give you a good idea of how well the LLM-as-a-judge will perform on production data.

Now, this LLM-as-a-judge can be used to evaluate large samples of production queries and identify the failure modes that are still present in your system.

Sub-lesson 2: Clustering production queries didn’t work for me

One approach that I’ve tested in the past (amounts to 1 year) and that didn’t work that well for me is to identify a set of classes that I expect user queries in my app to fall into. Identifying the classes helps in knowing which queries the model is not handling well and which classes of queries need more attention.

On top of the identified classes, I would create a category for “other” queries that don’t fall into any of the identified classes. an analysis of the queries that have fallen into the “other” category helped me identify new classes that I hadn’t thought about initially and better pinpoint the analysis of the performance of the model on different types of queries.

At the time, I used GPT-4 to do the classification. What didn’t work well for me was most queries ended up in the “other” category, defeating the purpose of the classification. This pattern showed up with a couple of other decoder LLMs.

While I thought about finetuning a Bert-like model to do the classification, having to finetune a new model whenever a new class is identified seemed like an overkill and a maintenance nightmare.

Lesson 4: Off-the-shelf evals don’t work that well

Off-the shelf evals don’t work that well. In the different frameworks/vendors interfaces for LLM applications evaluation, you’ll stumble upon two kinds of metrics:

LLM-based metrics: These metrics use an LLM to evaluate the quality of the output based on some criteria (e.g., relevance, coherence, etc.). While these metrics can be useful, the prompt used in them is generic and not at all tailored to your specific application and its requirements. Think of LLM-as-a-judge for a second. Usually, you would want the judge to have ample context about your application and the specific criterion you want to evaluate. A generic prompt won’t cut it at all. The same goes for the metrics that are LLM-based.
Non-LLM-based metrics: These metrics are usually quite generic (e.g., BLEU, ROUGE, etc.) and don’t capture how well your application is performing on a business standpoint.

In this page, you can find many examples of both families of metrics.

Thse non-custom metrics will tell you if your system is so bad but can’t be used that much for incremental improvements. Many times, I’ve seen that incremental improvements have as much impact on the LLM-based metrics as just rerunning your code with the same prompt and the same settings.

Now, all the vendors I’ve worked with allow you to create custom metrics (usually called scorer). So, you can use some vendor API to create your own custom metric in code and have it integrated in their evaluation interface.

Lesson 5: The evaluation interface matters

When using an interface to evaluate your LLM applications, the interface matters a lot. A good interface should in general empower you to spot the failure modes and annotate your data easily.

I’ve worked with many vendors providing LLM evaluation interfaces for LLM applications. Their interfaces are quite similar and are actually pretty good to hit the ground running. They will allow you to go through the traces of your LLM applications, see the inputs and outputs, and annotate them with a few clicks. However, what I usually find lacking is the following:

One pattern that slows me most is not having a specific field for failure mode annotation. The slowing down appears most when for example some failure modes have already appeared in the data but I’ll have to write them down again and again for each new trace that I see. Having a specific field for failure mode annotation or just the most recurrent failure modes as options to select from would speed up the annotation process a lot.
Formatting 💄 is not customizable: Imagine you have an app for writing emails. When doing your error analysis, you want to see the email formatted as it would appear in an email client (with subject, greeting, body, signature, etc.). This will allow you to easily spot the failure modes instead of having to look at a JSON or a plain blob of text (which is the case of all the interfaces I know). The absence of appropriate formatting slows down the error analysis process a lot.

If you want to know more about this topic, I recommend you read this talk I wrote by Christopher Lovejoy on building LLM-native apps in vertical industries.

Final notes

Don’t hesitate to reach out to me on LinkedIn or via email (chsafouane@gmail.com) if you’d like to talk about this topic or if you have any questions. My company Lumiereai also offers consulting services on this topic.
This blog post is not exhaustive. Just before publishing it, I’ve noticed that Hamel Husain has published a very extensive blog post on the same topic with many more lessons learnt. I highly recommend you read it as well. He’s the most knowledgeable person I know on this topic and I’ve learnt a lot from his work.
I wasn’t paid at all to say this and this is a really a personal advice: If you have the money or your company can pay for it, I really advise you to take Hamel & Shreya’s course on LLM evals. It’s worth every penny and will save you a lot of time and effort in the long run. Otherwise, there is an upcoming book on the same topic that they are writing with Oreilly. Keep an eye on it.